Assessment validity in high-stakes Laboratory Science courses is often compromised by limited, familiar question pools that encourage rote memorization over conceptual mastery. This project implemented Generative AI (GenAI) models to rapidly produce thousands of novel, content-specific assessment questions, with prompt designs explicitly targeting specific course material and Bloom’s Taxonomy Levels. The intervention was deployed across two distinct courses, utilizing the questions in both low-stakes (formative, unlimited attempts) and high-stakes (summative, proctored) environments. Preliminary data showed a higher percentage of correct answers on low-stakes self-assessments compared to high-stakes summative exams, suggesting GenAI’s utility in managing assessment security and differentiating learning stages. GenAI significantly increased assessment throughput and student engagement by maintaining question freshness and eliminating reliance on shared test banks. These preliminary findings suggest successful modulation of difficulty via Bloom’s Taxonomy prompting and validation of GenAI as a scalable method for creating rigorous and accurate assessments, leading to a clearer perspective on true student progression and mastery.
The following is a draft and has not been peer-reviewed.
Assessment is a fundamental driver of student learning in higher education, particularly within the health professions where the demonstration of clinical competence is paramount. Among the various assessment modalities, Multiple-Choice Questions (MCQs) remain the most ubiquitous format due to their efficiency in grading, objectivity, and capacity to cover broad curriculum content in a short timeframe (Zaidi et al., 2018; Scully, 2017). When constructed effectively, MCQs can support learner engagement and provide critical formative feedback that bridges the gap between didactic instruction and clinical application. However, the widespread reliance on MCQs is accompanied by persistent concerns regarding item quality, cognitive depth, and the significant resource burdens associated with maintaining valid item banks.
A primary pedagogical criticism of MCQs is their tendency to target the lower levels of Bloom’s Taxonomy - specifically rote memorization and recall - rather than the analysis, synthesis, or evaluation required for professional practice (Zahoor et al., 2023; Middlemas & Hensal, 2009). While faculty often aim to assess critical thinking, research suggests a disconnect between educator intent and item quality; questions intended to measure higher-order thinking are frequently perceived by students as tests of rote memory. This misalignment encourages surface learning strategies, such as “cramming,” rather than the development of deep conceptual schemas (Azer, 2003; Zaidi et al., 2018).
To enhance validity, educators are encouraged to utilize scenario-based stems that require the application of knowledge. However, constructing high-quality, vignette-based MCQs is a resource-intensive and technically difficult task (Hurtz et al., 2012). This challenge is compounded by the necessity of “refreshing” item banks to mitigate “item decay” - the phenomenon where questions become easier over time due to exposure (Joncas et al., 2018; Naidoo, 2023). The “item-writing bottleneck” creates a tension between the need for frequent, engaging formative assessments and the limited time faculty have to produce them (Ali et al., 2018).
The advent of Generative AI and Large Language Models (LLMs) offers a potential solution to these historical limitations. Unlike traditional template-based generation, modern LLMs utilize Natural Language Processing (NLP) to interpret complex concepts and generate human-like text, acting as a force multiplier for educators (Al Shuraiqi et al., 2024; Karahan & Emekli, 2025). Recent literature suggests that AI tools can produce hundreds of items in minutes, potentially democratizing the creation of case-based assessments that mirror authentic clinical challenges (Laupichler et al., 2024). By facilitating the rapid creation of diverse, personalized practice questions, AI has the potential to support “test-enhanced learning,” transforming assessment from a passive measurement tool into an active engagement strategy (Indran et al., 2024).
However, the efficacy of AI in enhancing assessment validity remains under scrutiny. While AI offers speed, questions persist regarding its ability to reliably generate items at specific cognitive levels. Some studies indicate that AI-generated questions may be statistically easier or less discriminating than human-authored items, potentially failing to challenge high-performing students if not carefully prompted and reviewed (Kaya et al., 2025; Laupichler et al., 2024). Consequently, there is a need to rigorously evaluate whether AI can be directed to produce items that align with higher-order cognitive domains and whether the availability of these resources translates into tangible student engagement
To address these gaps, this study investigates the utility of Generative AI within an educational curriculum. Specifically, this research aims to assess the difficulty of AI-generated MCQs in relation to a prompted Bloom’s taxonomy level, while simultaneously evaluating course engagement through an analysis of the number of assessment attempts completed by students.
This study utilized a retrospective exploratory analysis to evaluate the efficacy of AI-generated assessment questions within a health professions curriculum. The study was conducted at an academic institution within a Medical Laboratory Science (MLS) program accredited by the National Accrediting Agency for Clinical Laboratory Sciences (NAACLS). Data were collected over one academic semester (Fall 2025). The assessment questions evaluated were integrated into the undergraduate and graduate curricula, specifically targeting two core courses: Clinical Hematology Lecture and Clinical Hematology Laboratory.
The study analyzed two distinct units: the psychometric properties of AI-generated items (Aim 1) and the assessment behaviors of the student cohort (Aim 2). The cohort consisted of 13 students (N = 7 Bachelor, N = 6 Master) enrolled concurrently in both courses. Inclusion criteria for the final analysis comprised all completed assessment attempts (quizzes and exams) containing the AI-generated questions and recorded during the study period. Attempts marked as “incomplete,” invalid based on completion duration, or data pertaining to students who withdrew from the program prior to course completion were excluded.
Assessment questions were generated using Google’s Gemini models (versions gemini-2.5-pro and gemini-3-pro-preview), accessed through Google’s AI Studio. To generate psychometrically sound assessment items, a structured prompt engineering strategy was employed (see Supplementary Appendix A for full text). The AI was assigned the persona of an expert Medical Laboratory Science assessor and directed via Chain-of-Thought prompting (Wei et al., 2022) to generate items across four specific levels of Bloom’s Taxonomy: Remember, Understand, Apply, and Analyze. The system instructions explicitly defined the required cognitive task for each level (e.g., “recall specific facts” vs. “differentiate relationships”) and provided logic for generating plausible distractors (e.g., “common clinical misconceptions”).
Following an initial quality check (“Stage 1”) that revealed a tendency for the AI to generate significantly longer text for correct answers—a common cueing flaw—a “Stage 2” protocol was implemented. This protocol enforced a “Concise Truth” rule to constrain correct answers to be direct and concise, and an “Adjective Loading” rule to artificially lengthen distractors. The final output required the question stem, four options, comprehensive remedial feedback for all choices, and an AI-predicted difficulty rating (1 = Easy - 10 = Hard).
The source material ingested by the AI included lecture PowerPoint presentations (Clinical Hematology Lecture) and Laboratory Investigation procedures (Clinical Hematology Laboratory). To ensure validity, the generated items underwent a “human-in-the-loop” review process by one board-certified MLS with a Specialist in Hematology credential and 13 years of Clinical Hematology experience. Items were vetted for content accuracy, distractor plausibility, grammatical syntax, and alignment with the ingested source material. Stage 1 generated questions (N = 4,019) and Stage 2 generated questions (N = 5,844) were added to question pools associated with specific learning modules corresponding to the weekly curriculum.
Assessments were administered via the Desire2Learn (D2L) Learning Management System. The assessment structure was divided into low-stakes formative quizzes and high-stakes summative examinations. Low-stakes assessments were available at the start of a learning module and consisted of unproctored quizzes (5-16 questions). Students were permitted unlimited attempts to encourage mastery and record the highest scored attempt for their gradebook. High-stakes assessments were administered at fixed dates and times and consisted of a single proctored attempt for module examinations (25-40 questions) and course final examinations (100 questions).
Each assessment attempt, including any subsequent attempts, was designed to deliver a random set of questions from the pool of AI-generated questions specific to the module content being assessed. For all assessments, a total time limit was set to allow 1.5 minutes per question. To facilitate learning, students received immediate feedback upon completion, including the student’s response, the correct response, and an AI-generated explanation of the rationale. Participation in the assessments was a mandatory component of the coursework.
Data were extracted directly from the D2L platform using the export features within the “Grade” option of assessments. The study focused on two primary aims: assessment of difficulty and assessment of engagement. Difficulty was calculated as the aggregate performance on all questions stratified by the questions’ associated Bloom’s level. Due to the potential of repeat question exposure, only the first question attempt for each student was considered. Engagement was defined as the total number of assessment attempts completed by students beyond the mandatory one attempt requirement.
Data cleaning was performed and statistical analyses were conducted using R (version 4.5.2). Descriptive statistics were calculated to characterize the AI-generated question pool.
To validate the efficacy of the prompt engineering interventions, two specific analyses were performed on the generated items. First, an independent samples t-test was conducted to compare the mean word count of correct answer options between Stage 1 (unconstrained) and Stage 2 (constrained). Second, a Chi-squared test of independence was used to compare the proportions of option word lengths - specifically the prevalence of items where the correct answer was the longest option - between Stage 1 and Stage 2, to assess the reduction of length-based cueing.
To evaluate Aim 1 (Difficulty), difficulty was operationalized as the proportion of correct student responses. A Chi-squared Test for Trend in Proportions was conducted to determine if student success rates significantly decreased as the prompted Bloom’s Taxonomy level increased (ordinal trend). Subsequent pairwise comparisons between specific Bloom’s levels were analyzed using Chi-squared tests of independence with Bonferroni correction to adjust for multiple comparisons. Additionally, Spearman’s Rank Correlation was used to assess the relationship between the AI-predicted difficulty rating (1 - 10) and actual student performance.
To evaluate Aim 2 (Engagement), descriptive statistics (median and interquartile range) were generated for the frequency distribution of attempts. Due to the small sample size (N = 13), non-parametric Spearman’s Rank Correlation was used to examine relationships between voluntary engagement frequency and final exam performance (Z-score). A p-value of < 0.05 was used to determine statistical significance.
A total of 9,826 valid items were generated and injected into the course curriculum. This pool comprised 4,019 items generated during Stage 1 (Unconstrained) and 5,807 items generated during Stage 2 (Constrained/Engineered).
To mitigate the tendency of Generative AI to produce conspicuously long correct answers, a “Concise Truth” rule was implemented in Stage 2. Descriptive statistics revealed a marked reduction in the length of correct answer options between stages. The median word count for correct answers decreased from 14 (IQR = 7-20) in Stage 1 to 5 (IQR = 2-8) in Stage 2. Similarly, character counts decreased from a median 95 (IQR = 53-129) to 34 (IQR = 19-52). A Mann-Whitney test indicated this reduction was statistically significant (p < .001). This significant reduction in word count was observed consistently across all four prompted Bloom’s Taxonomy levels (p < .001 for all levels).
To address the tendency for the correct answer to be the longest option, an “Adjective Loading” rule was introduced in Stage 2 distractors. In Stage 1, the correct answer was the longest option in 40.8% of items. Following the rule’s application in Stage 2, this frequency decreased by 1.8% to 39.0%. A Chi-squared test indicated no statistically significant difference between the stages in the frequency of the longest option being the correct answer (χ2 = 3.031, df = 1, p = .082).
| Option Length NOT Longest | Option Length Longest | |
|---|---|---|
| Stage 1 | 2381 | 1638 |
| Stage 2 | 3543 | 2264 |
The total item pool was evenly distributed across the targeted Bloom’s Taxonomy levels: Remember (24.9%), Understand (25.3%), Apply (24.8%), and Analyze (25.0%). Of the 9,826 items generated, students were exposed to 5,748 (58.5%) unique items. Stage 1 exposure (N = 3,255) had slightly more items at Remember level (N = 883, 28.2%) versus other levels - Understand (N = 789, 24.2%), Apply (N = 791, 24.3%), and Analyze (N = 792, 24.3%). Stage 2 exposure (N = 2,493) had a similar distribution: Remember (N = 737, 29.6%), Understand (N = 578, 23.1%), Apply (N = 582, 23.3%), and Analyze (N = 596, 23.9%)
Aggregate student performance demonstrated a significant inverse relationship between the prompted Bloom’s Taxonomy level and item difficulty. As the cognitive level increased, the percentage of correct responses decreased: Remember (80.7%), Understand (78.6%), Apply (75.8%), and Analyze (73.2%). A Chi-squared Test for Trend in Proportions confirmed this trend was statistically significant (χ2 = 70.556, df = 1, p < .001).
Post-hoc pairwise comparisons revealed significant differences in difficulty between the Remember level and both Apply (p < .001) and Analyze (p < .001) levels. Items at the Understand level were significantly different than those at the Analyze level (p < .001) and Apply level (p = .03). However, the difference between Apply and Analyze did not reach statistical significance (p = .08).
When stratified by course, items generated for the Clinical Hematology Laboratory were significantly more difficult (75.2% correct) than those for the Clinical Hematology Lecture (78.7% correct; χ2 = 24.948, df = 1, p < .001). Lecture items followed the aggregate trend, with difficulty significantly increasing across Bloom’s levels (χ2 = 77.77, df = 1, p < .001). Laboratory items also demonstrated a significant trend (χ2 = 8.228, df = 1, p = .004); however, pairwise comparisons indicated significant differences observed only between the Understand and Analyze levels (p = .02).
For Stage 2 items, the AI provided a predicted difficulty rating (1–10). Spearman’s Rank Correlation showed a strong positive correlation between the AI’s predicted rating and the prompted Bloom’s level (ρ = .815, p < .001).
Furthermore, a Chi-squared Test for Trend confirmed that actual student performance significantly declined as the AI-predicted difficulty rating increased (χ2 = 15.839, df = 5, p = .007), demonstrating a concordance between AI prediction and student performance. To meet criteria for this Chi-squared analysis, items at difficulty levels one and two were grouped and difficulties seven and greater were grouped.
| Difficulty | Exposed (N) | Correct (N) |
|---|---|---|
| 1 | 69 | 66 |
| 2 | 899 | 698 |
| 3 | 613 | 485 |
| 4 | 747 | 563 |
| 5 | 817 | 607 |
| 6 | 427 | 316 |
| 7 | 152 | 101 |
| 8 | 13 | 11 |
| 9 | 4 | 3 |
| 10 | 1 | 1 |
A total of 2,802 voluntary attempts (defined as attempts beyond the mandatory requirement) were recorded across the cohort (N = 13). The median number of extra attempts per student was 138 (IQR = 98-258), with individual engagement ranging from 44 to 753 total extra attempts. Engagement was higher in the Lecture course (N Attempts = 1,689; Median = 94, IQR = 55-162) compared to the Laboratory course (N Attempts = 1,113; Median = 58, IQR = 43-98).
There was a strong, positive correlation between the frequency of voluntary assessment attempts and student overall course. Spearman’s Rank Correlation analysis revealed a significant relationship between the number of practice attempts and final course grades as Z-scores (ρ = .638, p < .001).
This study sought to evaluate the efficacy of leveraging Generative AI to address the “item-writing bottleneck” in health professions education by assessing the validity, difficulty, and engagement potential of AI-generated MCQs. The results demonstrate that while LLMs can function as powerful “force multipliers” capable of generating vast item pools, they exhibit distinct limitations in psychometric precision and higher-order cognitive differentiation. The findings suggest that AI is highly effective at fostering student engagement through volume-based testing, but requires significant human intervention (“prompt engineering”) to mitigate inherent biases and validity threats.
A persistent criticism of AI-generated content is its tendency toward verbosity, which can introduce “test-wiseness” cues where the longest answer is frequently the correct one (Artsi et al., 2024). This study’s “Stage 2” intervention successfully addressed the verbosity of correct answers, achieving a statistically significant reduction in word count through the “Concise Truth” rule. However, the attempt to counterbalance this by artificially lengthening distractors (“Adjective Loading”) failed to produce a statistically significant reduction in cueing globally (p = .08). While subgroup analyses suggested efficacy at the Remember and Analyze levels, the lack of a consistent effect across all domains indicates that the AI prioritized semantic logic over rigid structural constraints. This finding aligns with recent literature suggesting that LLMs favor the “truth” and may resist generating plausible but incorrect distractors that match the complexity of the correct answer without extensive, iterative chain-of-thought prompting (Yao et al., 2025). Consequently, educators must remain vigilant during the review process, as algorithmic constraints alone may not fully eliminate psychometric flaws.
The analysis of item difficulty revealed a statistically significant, albeit shallow, gradient across Bloom’s Taxonomy. While the AI successfully made Remember items easier than Analyze items, the practical difference in student performance was narrow (7.5%), and the system failed to significantly differentiate between the Apply and Analyze levels (p = .08). This suggests a “Complexity Ceiling,” where the AI mimics the structure of higher-order questions (e.g., using clinical vignettes) but may not achieve the requisite increase in cognitive load.
This plateau mirrors findings by Law et al. (2025) and Shuraiqi et al. (2024), who noted that AI-generated items often test the retrieval of isolated facts embedded within a vignette rather than true clinical synthesis. The significant difference in difficulty between Lecture and Laboratory items further complicates this picture, suggesting that the prompt source material influences the model’s ability to generate difficulty. The strong correlation between the AI’s predicted difficulty rating and student performance is promising, yet it indicates the model is better at predicting how students will perform on its own logic than it is at constructing truly distinct cognitive challenges at the highest levels of taxonomy.
The most robust finding of this study is the high volume of voluntary student engagement. With a median of 138 extra attempts per student, the availability of an unlimited, low-stakes AI question bank facilitated massive “test-enhanced learning.” The strong correlation (ρ = .638) between these voluntary attempts and final course performance supports the pedagogical value of frequent formative retrieval practice (Say et al., 2022; Mui Lim & Rodger, 2010; Steinel et al., 2022; Kulasegaram & Rangachari, 2018). However, these engagement data must be interpreted with extreme caution due to the small sample size (N=13). The correlation is susceptible to the influence of outliers - specifically, “super-users” who completed over 700 attempts. While the signal is positive, it is unclear if the benefit is derived from the quality of the AI items or simply the quantity of time-on-task. Furthermore, it cannot be rule out that high-performing students are simply more motivated to use available resources, rather than the resources causing the high performance.
Several limitations constrain the generalizability of these findings. First, the disparity between the number of items generated (N=9,826) and the number of students (N=13) results in high statistical power for item analysis but low power for student outcome analysis. Second, because items were randomized, students were exposed to a subset (N=5,748) of the total pool; while the assumumption that the randomization was uniform, it is possible that the difficulty trends observed are artifacts of the specific items exposed rather than the entire generated corpus. Finally, the study was conducted at a single institution within a specific domain (Hematology). The “Concise Truth” and “Adjective Loading” rules may perform differently in domains requiring less descriptive nuance.
Generative AI represents a transformative tool for assessment creation, offering a solution to the labor-intensive demands of item writing. It succeeds in creating engaging, grammatically coherent content that scales to student demand. However, it is not yet a “set it and forget it” solution. The “Complexity Ceiling” and the persistence of distractor flaws highlight the necessity of a “human-in-the-loop” workflow. Educators must view AI as a drafter of content that requires expert refinement to ensure it assesses deep clinical reasoning rather than superficial recall. Future research should focus on optimizing prompts to break through the complexity ceiling and validating these findings across larger, multi-institutional cohorts.
Zaidi NL, Grob KL, Monrad SM, Kurtz JB, Tai A, Ahmed AZ, Gruppen LD, Santen SA. Pushing critical thinking skills with multiple-choice questions: does Bloom’s taxonomy work?. Academic Medicine. 2018 Jun 1;93(6):856-9.
Scully D. Constructing multiple-choice items to measure higher-order thinking. Practical Assessment, Research and Evaluation (PARE). 2017;22(1):1-3.
Zahoor AW, Farooqui SI, Khan A, Kazmi SA, Qamar N, Rizvi J. Evaluation of Cognitive Domain in Objective Exam of Physiotherapy Teaching Program by Using Bloom’s Taxonomy. Journal of Health and Allied Sciences NU. 2023 Apr;13(02):289-93.
Middlemas DA, Hensal C. Issues in selecting methods of evaluating clinical competence in the health professions: implications for athletic training education. Athletic Training Education Journal. 2009 Jul 1;4(3):109-16.
Azer SA. Assessment in a problem‐based learning course: Twelve tips for constructing multiple choice questions that test students’ cognitive skills. Biochemistry and Molecular Biology Education. 2003 Nov;31(6):428-34.
Hurtz GM, Chinn RN, Barnhill GC, Hertz NR. Measuring clinical decision making: do key features problems measure higher level cognitive processes?. Evaluation & the health professions. 2012 Dec;35(4):396-415.
Joncas SX, St-Onge C, Bourque S, Farand P. Re-using questions in classroom-based assessment: an exploratory study at the undergraduate medical education level. Perspectives on medical education. 2018 Dec;7(6):373-8.
Naidoo M. The pearls and pitfalls of setting high-quality multiple choice questions for clinical medicine. South African Family Practice. 2023;65(3).
Ali K, Zahra D, Tredwin C, Mcilwaine C, Jones G. Use of progress testing in a UK dental therapy and hygiene educational program. Journal of dental education. 2018 Feb;82(2):130-6.
Al Shuraiqi S, Aal Abdulsalam A, Masters K, Zidoum H, AlZaabi A. Automatic generation of medical case-based multiple-choice questions (MCQs): a review of methodologies, applications, evaluation, and future directions. Big Data and Cognitive Computing. 2024 Oct 17;8(10):139.
Laupichler MC, Rother JF, Grunwald Kadow IC, Ahmadi S, Raupach T. Large language models in medical education: comparing ChatGPT-to human-generated exam questions. Academic Medicine. 2024 May;99(5):508-12.
Indran IR, Paranthaman P, Gupta N, Mustafa N. Twelve tips to leverage AI for efficient and effective medical question generation: a guide for educators using Chat GPT. Medical Teacher. 2024 Aug 2;46(8):1021-6.
Kaya M, Sonmez E, Halici A, Yildirim H, Coskun A. Comparison of AI-generated and clinician-designed multiple-choice questions in emergency medicine exam: a psychometric analysis. BMC Medical Education. 2025 Jul 1;25(1):949.
Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le QV, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems. 2022 Dec 6;35:24824-37.
Yao Z, Parashar A, Zhou H, Jang WS, Ouyang F, Yang Z, Yu H. Mcqg-srefine: Multiple choice question generation and evaluation with iterative self-critique, correction, and comparison feedback. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) 2025 Apr (pp. 10728-10777).
Kulasegaram K, Rangachari PK. Beyond “formative”: assessments to enrich student learning. Advances in physiology education. 2018 Mar 1;42(1):5-14.
System Instruction
You are a question generator to assess a Medical Laboratory Science student’s understanding of [insert topic]. Questions should be associated with a level of Bloom’s taxonomy. Each question should be multiple choice with four selections. Feedback should be included that discusses why a selection response is correct and the other selection responses are incorrect.
Instruction for question generation:
Unique questions must be generated for each prompt. Questions must be
generated at the requested level of Bloom’s taxonomy:
Output format
Question Text: (insert question text)
A) insert BEST/CORRECT answer
B) insert DISTRACTOR/PLAUSIBLE FOIL
C) insert DISTRACTOR/PLAUSIBLE FOIL
D) insert DISTRACTOR/PLAUSIBLE FOIL
Feedback:
A) Incorrect:/Correct: insert clear and thorough reasoning
B) Incorrect:/Correct: insert clear and thorough reasoning
C) Incorrect:/Correct: insert clear and thorough reasoning
D) Incorrect:/Correct: insert clear and thorough reasoning
Difficulty Level: (provide an objective level of difficulty for the
question on a scale of 1 (easy) to 10 (expert))
Difficulty Reason: (thorough/elaborate reasoning to difficulty
judgement)
Prompt
5 questions at taxonomy level [insert Bloom level]
System Instruction
[Insert Stage 1 System Instruction with the following appended]
Instruction for option lengths (STRICT ENFORCEMENT):
Output format
Difficulty Level: (provide an objective level of difficulty for the
question on a scale of 1 (easy) to 10 (expert))
Difficulty Reason: (thorough/elaborate reasoning to difficulty
judgement)
Prompt
5 questions at taxonomy level [insert Bloom level]